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Abstract 

This paper presents a corpus-based approach to 
word sense disambiguation that builds an ensemble 
of Naive Bayesian classifiers, each of which is based 
on lexical features that represent co-occurring words 
in varying sized windows of context. Despite the 
simplicity of this approach, empirical results disam- 
biguating the widely studied nouns line and interest 
show that such an ensemble achieves accuracy rival- 
ing thg hrvat prpvimisly pnl-ilishnH rosnlts 



1 Introduction 

Word sense disambiguation is often cast as a prob- 
lem in supervised learning, where a disambiguator is 
induced from a corpus of manually sense-tagged text 
using methods from statistics or machine learning. 
These approaches typically represent the context in 
which each sense-tagged instance of a word occurs 
with a set of linguistically motivated features. A 
learning algorithm induces a representative model 
from these features which is employed as a classifier 
to perform disambiguation. 

This paper presents a corpus-based approach that 
results in high accuracy by combining a number of 
very simple classifiers into an ensemble that per- 
forms disambiguation via a majority vote. This is 
motivated by the observation that enhancing the fea- 
ture set or learning algorithm used in a corpus-based 
approach does not usually improve disambiguation 
accuracy beyond what can be attained with shallow 
lexical features and a simple supervised learning al- 
gorithm. 



For example, a Naive Bayesian classifier (Duda 
and Hart, 1973 ) is based on a blanket assumption 
about the interactions among features in a sense- 
tagged corpus and does not learn a representative 
model. Despite making such an assumption, this 
proves to be among the most accurate techniques 
in comparative studies of corpus-ba sed word sense 
disambi guation metho d ologies (e.g., ( Leacock et al 



1993|), (jMooncy, 1996|) , flNg and Lee, 19961 ), ( pc~ 

jen and Bruce, 1997| )). These studies represent the 



tribution of each type of feature to o verall accuracy 
is analyzed (eg. ( Ng and Lee, 199^ )), shallow lexi- 
cal features such as co-occurrences and collocations 
prove to be stronger contributors to accuracy than 
do deeper, linguistically motivated features such as 
part-of-speech and verb-object relationships. 

It has also been shown that the combined accuracy 
of an ensemble of multiple classifiers is often signifi- 
cantly greater than that of any of the indi vidual clas- 



sifiers that make up the ensemble (e.g., (Dietterich 



1997)). In natural language processing, ensemble 



context in which an ambiguous word occurs with a 
wide variety of features. However, when the con- 



techniques have been su ccessfully applied to part- 
of-speech tagg ing (e.g., (Brill and Wu, 199S| )) and 
parsing (e.g., ( Henderson and Brill, 1999| )). When 
combined with a history of disambiguation success 
using shallow lexical features and Naive Bayesian 
classifiers, these findings suggest that word sense dis- 
ambiguation might best be improved by combining 
the output of a number of such classifiers into an 
ensemble. 

This paper begins with an introduction to the 
Naive Bayesian classifier. The features used to rep- 
resent the context in which ambiguous words occur 
are presented, followed by the method for selecting 
the classifiers to include in the ensemble. Then, the 
line and interest data is described. Experimental re- 
sults disambiguating these words with an ensemble 
of Naive Bayesian classifiers are shown to rival pre- 
viously published results. This paper closes with a 
discussion of the choices made in formulating this 
methodology and plans for future work. 

2 Naive Bayesian Classifiers 

A Naive Bayesian classifier assumes that all the fea- 
ture variables representing a problem are condition- 
ally independent given the value of a classification 
variable. In word sense disambiguation, the context 
in which an ambiguous word occurs is represented by 
the feature variables (Fi, F%, ■ ■ ■ , F n ) and the sense 
of the ambiguous word is represented by the classi- 
fication variable (S). In this paper, all feature vari- 
ables Fi are binary and represent whether or not a 
particular word occurs within some number of words 
to the left or right of an ambiguous word, i.e., a win- 



dow of context. For a Naive Bayesian classifier, the 
joint probability of observing a certain combination 
of contextual features with a particular sense is ex- 
pressed as: 

n 

p(F 1 ,F 2 ,...,F n: S)= P (S)l[p(F l \S) 

i=l 

The parameters of this model are p{S) and 
p(Fi\S). The sufficient statistics, i.e., the summaries 
of the data needed for parameter estimation, are the 
frequency counts of the events described by the in- 
terdependent variables (Fi,S). In this paper, these 
counts are the number of sentences in the sense- 
tagged text where the word represented by Fj oc- 
curs within some specified window of context of the 
ambiguous word when it is used in sense S. 

Any parameter that has a value of zero indicates 
that the associated word never occurs with the spec- 
ified sense value. These zero values are smoothed 
by assigning them a very small default probability. 
Once all the parameters have been estimated, the 
model has been trained and can be used as a clas- 
sifier to perform disambiguation by determining the 
most probable sense for an ambiguous word, given 
the context in which it occurs. 

2.1 Representation of Context 

The contextual features used in this paper are bi- 
nary and indicate if a given word occurs within some 
number of words to the left or right of the ambigu- 
ous word. No additional positional information is 
contained in these features; they simply indicate if 
the word occurs within some number of surrounding 
words. 

Punctuation and capitalization are removed from 
the windows of context. All other lexical items are 
included in their original form; no stemming is per- 
formed and non-content words remain. 

This representation of context is a variation on 
the bag-of-words feature set, where a single window 
of context includes words that occur to both the left 
and right of the ambiguous word. An early use o f 
this representation is described in ( |Gale et al., 1992 ) , 
where word sense disambiguation is performed with 
a Naive Bayesian classifier. The work in this pa- 
per differs in that there are two windows of context, 
one representing words that occur to the left of the 
ambiguous word and another for those to the right. 

2.2 Ensembles of Naive Bayesian Classifiers 

The left and right windows of context have nine dif- 
ferent sizes; 0, 1, 2, 3, 4, 5, 10, 25, and 50 words. 
The first step in the ensemble approach is to train a 
separate Naive Bayesian classifier for each of the 81 
possible combination of left and right window sizes. 

Naive_Bayes (l,r) represents a classifier where the 
model parameters have been estimated based on fre- 



quency counts of shallow lexical features from two 
windows of context; one including I words to the left 
of the ambiguous word and the other including r 
words to the right. Note that Naive_Bayes (0,0) in- 
cludes no words to the left or right; this classifier acts 
as a majority classifier that assigns every instance of 
an ambiguous word to the most frequent sense in 
the training data. Once the individual classifiers are 
trained they are evaluated using previously held-out 
test data. 

The crucial step in building an ensemble is select- 
ing the classifiers to include as members. The ap- 
proach here is to group the 81 Naive Bayesian clas- 
sifiers into general categories representing the sizes 
of the windows of context. There are three such 
ranges; narrow corresponds to windows 0, 1 and 2 
words wide, medium to windows 3, 4, and 5 words 
wide, and wide to windows 10, 25, and 50 words 
wide. There are nine possible range categories since 
there are separate left and right windows. For exam- 
ple, Naive_Bayes(l,3) belongs to the range category 
(narrow, medium) since it is based on a one word 
window to the left and a three word window to the 
right. The most accurate classifier in each of the 
nine range categories is selected for inclusion in the 
ensemble. Each of the nine member classifiers votes 
for the most probable sense given the particular con- 
text represented by that classifier; the ensemble dis- 
ambiguates by assigning the sense that receives a 
majority of the votes. 

3 Experimental Data 

The line data was created by ( Leacock et al., 1993j ) 



by tagging every occurrence of line in the ACL/DCI 
Wall Street Journal corpus and the American Print- 
ing House for the Blind corpus with one of six pos- 
sible WordNet senses. These senses and their fre- 
quency distribution are shown in Ta ble |l|. This dat a 
has since been used in studies by ( Mooncy, 1996|) 



[Towell and Voorhees, 1998), and ( [Leacock et al 



1998). In that work, as well as in this paper, a subset 



of the corpus is utilized such that each sense is uni- 
formly distributed; this reduces the accuracy of the 
majority classifier to 17%. The uniform distribution 
is created by randomly sampling 349 sense-tagged 
examples from each sense, resulting in a training cor- 
pus of 2094 sense-tagged sentences. 

The interest data was created by (Bruce and 



Wiebe, 1994) by tagging all occurrences of interest 



in the ACL/DCI Wall Street Journal corpus with 
senses from the Longman Dictionary of Contempo- 
rary English. This data set was subsequently used 
for word sense d isambiguation experi ments by (Ng 
and Lee, 199^), (|Pedersen et al., 1997| ), and ( [Peder 



en and Bruce, 1997 ). The previous studies and this 
paper use the entire 2,368 sense-tagged sentence cor- 
pus in their experiments. The senses and their fre- 



sense 


count 


product 


2218 


written or spoken text 


405 


telephone connection 


429 


formation of people or things; queue 


349 


an artificial division; boundary 


376 


a thin, flexible object; cord 


371 


total 


4148 



Table 1: Distribution of senses for line - the exper- 
iments in this paper and previous work use a uni- 
formly distributed subset of this corpus, where each 
sense occurs 349 times. 



sense 


count 


money paid for the use of money 


1252 


a share in a company or business 


500 


readiness to give attention 


361 


advantage, advancement or favor 


178 


activity that one gives attention to 


66 


causing attention to be given to 


11 


total 


2368 



Table 2: Distribution of senses for interest - the ex- 
periments in this paper and previous work use the 
entire corpus, where each sense occurs the number 
of times shown above. 



quency distribution are shown in Table Unlike 
line, the sense distribution is skewed; the majority 
sense occurs in 53% of the sentences, while the small- 
est minority sense occurs in less than 1%. 

4 Experimental Results 

Eighty-one Naive Bayesian classifiers were trained 
and tested with the line and interest data. Five- 
fold cross validation was employed; all of the sense- 
tagged examples for a word were randomly shuffled 
and divided into five equal folds. Four folds were 
used to train the Naive Bayesian classifier while the 
remaining fold was randomly divided into two equal 
sized test sets. The first, devtest, was used to eval- 
uate the individual classifiers for inclusion in the en- 
semble. The second, test, was used to evaluate the 
accuracy of the ensemble. Thus the training data 
for each word consists of 80% of the available sense- 
tagged text, while each of the test sets contains 10%. 

This process is repeated five times so that each 
fold serves as the source of the test data once. The 
average accuracy of the individual Naive Bayesian 
classifiers across the five folds is reported in Tables 
H and |]. The standard deviations were between .01 
and .025 and are not shown given their relative con- 
sistency. 



Each classifier is based upon a distinct representa- 
tion of context since each employs a different com- 
bination of right and left window sizes. The size 
and range of the left window of context is indicated 
along the horizontal margin in Tables | and | while 
the right window size and range is shown along the 
vertical margin. Thus, the boxes that subdivide each 
table correspond to a particular range category. The 
classifier that achieves the highest accuracy in each 
range category is included as a member of the ensem- 
ble. In case of a tie, the classifier with the smallest 
total window of context is included in the ensemble. 

The most accurate single classifier for line is 
Naive _Bayes (4,25), which attains accuracy of 84% 
The accuracy of the ensemble created from the most 
accurate classifier in each of the range categories is 
88%. The single most accurate classifier for interest 
is Naive_Bayes(4,l), which attains accuracy of 86% 
while the ensemble approach reaches 89%. The in- 
crease in accuracy achieved by both ensembles over 
the best individual classifier is statistically signifi- 
cant, as judged by McNemar's test with p = .01. 

4.1 Comparison to Previous Results 

These experiments use the same sense-tagged cor- 
pora for interest and line as previous studies. Sum- 
maries of previous results in Tables |5| and ^ show 
that the accuracy of the Naive Bayesian ensemble 
is comparable to that of any other approach. How- 
ever, due to variations in experimental methodolo- 
gies, it can not be concluded that the differences 
among the most accurate methods are statistically 
significant. For example, in this work five-fold cr oss 
validation is em ployed to assess accuracy while ( Ng 
and Lee, 1996) train and test using 100 randomly 
sampled sets of data. Similar differences in train- 
ing and testing methodology exist among the other 
studies. Still, the results in this paper are encourag- 
ing due to the simplicity of the approach. 

4.1.1 Interest 



The interest data was first studied by ( Bruce and 
Wiebe, 1994). They employ a representation of 



context that includes the part-of-speech of the two 
words surrounding interest, a morphological feature 
indicating whether or not interest is singular or plu- 
ral, and the three most statistically significant co- 
occurring words in the sentence with interest, as de- 
termined by a test of independence. These features 
are abbreviated as p-o-s, morph, and co-occur in 
Table S. A decomposable probabilistic model is in- 
duced from the sense-tagged corpora using a back- 
ward sequential search where candidate models are 
evaluated with the log-likelihood ratio test. The se- 
lected model was used as a probabilistic classifier on 
a hcld-out set of test data and achieved accuracy of 
78%. 



The interest data was included in a study by ( Ng 
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wide 


25 


.63 


.74 


.80 


.82 


.84 


83 


.83 


.83 


.83 




10 


.62 


.75 


.81 


.82 


.83 


83 


.83 


.83 


.84 




5 


.61 


.75 


.80 


.81 


.82 


82 


.82 


.82 


.83 


medium 


4 


.60 


.73 


.80 


.82 


.82 


82 


.82 


.82 


.82 




3 


.58 


.73 


.79 


.82 


.83 


83 


.82 


.81 


.82 




2 


.53 


.71 


.79 


.81 


.82 


82 


.81 


.81 


.81 


narrow 


1 


.42 


.68 


.78 


.79 


.80 


79 


.80 


.81 


.81 







.14 


.58 


.73 


.77 


.79 


79 


.79 


.79 


.80 









1 

narrow 


2 


3 


4 

medium 


5 


10 


25 
wide 


50 



Table 3: Accuracy of Naive Bayesian classifiers for line evaluated with the devtest data. The italicized 
accuracies are associated with the classifiers included in the ensemble, which attained accuracy of 88% when 
evaluated with the test data. 
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.73 


.83 


.85 


.86 


.85 


85 


.83 


.81 


.81 


medium 
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.72 


.83 


.85 


.85 
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.80 
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.70 


.84 


.86 


.86 


.86 


85 
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.81 


.80 
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.66 


.83 


.85 


.86 


.86 


84 


.83 


.80 


.80 


narrow 
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.63 


.82 


.85 


.85 


.86 


85 


.82 


.81 


.80 







.53 


.72 


.77 


.78 


.79 


77 


.77 


.76 


.75 









1 

narrow 


2 


3 


4 

medium 


5 


10 


25 
wide 


50 



Table 4: Accuracy of Naive Bayesian classifiers for interest evaluated with the devtest data. The italicized 
accuracies are associated with the classifiers included in the ensemble, which attained accuracy of 89% when 
evaluated with the test data. 



and Lee, 1996), who represent the context of an 
ambiguous word with the part-of-speech of three 



4.1.2 Line 



wuids Lu Lhu lull and lighL uf interest, a iiioipho- 



logical feature indicating if interest ia singular or 

plural, an unordered set of frequently occurring 
keywords that surround interest, local collocations 
that include interest, and verb-object syntactic re- 
lationships. These features are abbreviated p-o-s, 
morph, co-occur, collocates, and verb-obj in Table 
A nearest-neighbor classifier was employed and 
achieved an average accuracy of 87% over repeated 
trials using randomly drawn training and test sets. 



1 



(Pcdersen et al., 1997) and (Pedersen and Bruce 



1997) present studies that utilize the original Bruce 



and Wiebe feature set and include the interest data. 
The first compares a range of probabilistic model 
selection methodologies and finds that none outper- 
form the Naive Bayesian classifier, which attains ac — 



curacy of 74% , The second compares a range of ma- 

chine learning algorithms and finds that a decision 
tree learner (78%) and a Naive Bayesian classifier 
(74%) are most accurate. 



The line data was first studied by ( Leacock et al 



1993|). They evaluate the disambiguation accuracy 



of a Naive Bayesian classifier, a content vector, and 
a neural network. The context of an ambiguous 
word is represented by a bag-of-words where the 
window of context is two sentences wide. This fea- 
ture set is abbreviated as 2 sentence b-o-w in Table 
H When the Naive Bayesian classifier is evaluated 
words are not stemmed and capitalization remains. 
However, with the content vector and the neural net- 
work words are stemmed and words from a stop-list 
are removed. They report no significant differences 
in accuracy among the three approaches; the Naive 
Bayesian classifier achieved 71% accuracy, the con- 
tent vector 72%, and the neural network 76%. 



Th e line data was studied again by (Mooney. 
1996), where seven different machine learning 



methodologies are compared. All learning algo- 
rithms represent the context of an ambiguous word 
using the bag-of-words with a two sentence window 
of context. In these experiments words from a stop- 





accuracy 


method 


feature set 


Naive Bayesian Ensemble 


89% 


ensemble of 9 


varying left & right b-o-w 


Ng & Lee, 1996 


87% 


nearest neighbor 


p-o-s, morph, co-occur 
collocates, verb-obj 


Bruce & Wiebe, 1994 


78% 


model selection 


p-o-s, morph, co-occur 


Pedersen & Bruce, 1997 


78% 
74% 


decision tree 
naive bayes 


p-o-s, morph, co-occur 



Table 5: Comparison to previous results for interest 





accuracy 


method 


feature set 


Naive Bayesian Ensemble 


88% 


ensemble of 9 


varying left & right b-o-w 


Towell & Voorhess, 1998 


87% 


neural net 


local & topical b-o-w, p-o-s 


Leacock, Chodorow, & Miller, 1998 


84% 


naive bayes 


local & topical b-o-w, p-o-s 


Leacock, Towell, & Voorhees, 1993 


76% 
72% 
71% 


neural net 
content vector 
naive bayes 


2 sentence b-o-w 


Mooney, 1996 


72% 
71% 


naive bayes 
perceptron 


2 sentence b-o-w 



Table 6: Comparison to previous results for line 



list are removed, capitalization is ignored, and words 
are stemmed. The two most accurate methods in 
this study proved to be a Naive Bayesian classifier 
(72%) and a perceptron (71%). 



E 



The line data was rec ently r evisited by both ( Tow- 
11 and Voorhees, 199§| ) and ( Leacock ct al., 1998| ) 



The former take an ensemble approach where the 
output from two neural networks is combined; one 
network is based on a representation of local con- 
text while the other represents topical context. The 
latter utilize a Naive Bayesian classifier. In both 
cases context is represented by a set of topical and 
local features. The topical features correspond to 
the open-class words that occur in a two sentence 
window of context. The local features occur within a 
window of context three words to the left and right 
of the ambiguous word and include co-occurrence 
features as well as the part-of-speech of words in 
this window. These features are represente d as lo- 
cal fc topical b-o-w and p-o-s in Table |^. (Towell 
and Voorhees, 1998|) report accuracy of 87% while 
(Leacock ct al., 199S ) report accuracy of 84%. 



5 Discussion 

The word sense disambiguation ensembles in this pa- 
per have the following characteristics: 

• The members of the ensemble are Naive 
Bayesian classifiers, 

• the context in which an ambiguous word oc- 
curs is represented by co-occurrence features 



extracted from varying sized windows of sur- 
rounding words, 

• member classifiers are selected for the ensembles 
based on their performance relative to others 
with comparable window sizes, and 

• a majority vote of the member classifiers deter- 
mines the outcome of the ensemble. 

Each point is discussed below. 

5.1 Naive Bayesian classifiers 

The Naive Bayesian classifier has emerged as a con- 
sistently strong performer in a wide range of com- 
parative studies of machine learning methodologies. 
A recent survey of such results, as well as pos- 
sible explanations for its succ ess, is presented in 
(Domingos and Pazzani, 1997). A similar finding 
has emerged in word sense disambiguation, where 
a number of comparative studies have all reported 
that no method achieves significantly greater accu- 
racy than the Na ive Bayesian class ifie r (e.g., ( Lea- 



sock et al., 1993), (Mooney, 199€), (Ng and Lee 



1996), (Pedersen and Bruce, 1997)) 



In many ensemble approaches the member classi- 
fiers are learned with different algorithms that are 
trained with the same data. For example, an en- 
semble could consist of a decision tree, a neural net- 
work, and a nearest neighbor classifier, all of which 
are learned from exactly the same set of training 
data. This paper takes a different approach, where 
the learning algorithm is the same for all classifiers 



but the training data is different. This is motivated 
by the belief that there is more to be gained by vary- 
ing the representation of context than there is from 
using many different learning algorithms on the same 
data. This is especially true in this domain since the 
Naive Bayesian classifier has a history of success and 
since there is no generally agreed upon set of features 
that have been shown to be optimal for word sense 
disambiguation. 

5.2 Co occurrence features 

Shallow lexical features such as co-occurrences and 
collocations are recognized as potent sources of dis- 
ambiguation information. While many other con- 
textual features are often employed, it isn't clear 
that they offer substa ntial advantages. For exam- 
ple, ( Ng and Lee, 1996 ) report that local collocations 
alone achieve 80% accuracy disambiguating interest, 
while their full set of features result in 87%. Prelim- 
inary experiments for this paper used feature sets 
that included collocates, co-occurrences, part-of- 
speech and grammatical information for surrounding 
words. However, it was clear that no combination of 
features resulted in disambiguation accuracy signifi- 
cantly higher than that achieved with co-occurrence 
features. 

5.3 Selecting ensemble members 

The most accurate classifier from each of nine pos- 
sible category ranges is selected as a member of 
the ensemble. This is based on preliminary experi- 
ments that showed that member classifiers with sim- 
ilar sized windows of context often result in little or 
no overall improvement in disambiguation accuracy. 
This was expected since slight differences in window 
sizes lead to roughly equivalent representations of 
context and classifiers that have little opportunity 
for collective improvement. For example, an ensem- 
ble was created for interest using the nine classifiers 
in the range category (medium, medium). The ac- 
curacy of this ensemble was 84%, slightly less than 
the most accurate individual classifiers in that range 
which achieved accuracy of 86%. 

Early experiments also revealed that an ensemble 
based on a majority vote of all 81 classifiers per- 
formed rather poorly. The accuracy for interest was 
approximately 81% and line was disambiguated with 
slightly less than 80% accuracy. The lesson taken 
from these results was that an ensemble should con- 
sist of classifiers that represent as differently sized 
windows of context as possible; this reduces the im- 
pact of redundant errors made by classifiers that 
represent very similarly sized windows of context. 
The ultimate success of an ensemble depends on the 
ability to select classifiers that make complementary 
errors. This is discussed in the context of combin- 



ing part-of-speech taggers in (Brill and Wu, 1998) 



They provide a measure for assessing the comple- 



mentarity of errors between two taggers that could 
be adapted for use with larger ensembles such as the 
one discussed here, which has nine members. 

5.4 Disambiguation by majority vote 

In this paper ensemble disambiguation is based on a 
simple majority vote of the nine member classifiers. 

An alternative strategy is to weight each vote by 
the estimated joint probability found by the Naive 
Bayesian classifier. However, a preliminary study 
found that the accuracy of a Naive Bayesian ensem- 
ble using a weighted vote was poor. For interest, 
it resulted in accuracy of 83% while for line it was 
82%. The simple majority vote resulted in accuracy 
of 89% for interest and 88% for line. 

6 Future Work 

A number of issues have arisen in the course of this 
work that merit further investigation. 

The simplicity of the contextual representation 
can lead to large numbers of parameters in the Naive 
Bayesian model when using wide windows of con- 
text. Some combination of stop-lists and stemming 
could reduce the numbers of parameters and thus 
improve the overall quality of the parameter esti- 
mates made from the training data. 

In addition to simple co-occurrence features, the 
use of collocation features seems promising. These 
are distinct from co-occurrences in that they are 
words that occur in close proximity to the ambiguous 
word and do so to a degree that is judged statisti- 
cally significant. 

One limitation of the majority vote in this paper 
is that there is no mechanism for dealing with out- 
comes where no sense gets a majority of the votes. 
This did not arise in this study but will certainly 
occur as Naive Bayesian ensembles are applied to 
larger sets of data. 

Finally, further experimentation with the size of 
the windows of context seems warranted. The cur- 
rent formulation is based on a combination of intu- 
ition and empirical study. An algorithm to deter- 
mine optimal windows sizes is currently under de- 
velopment. 

7 Conclusions 

This paper shows that word sense disambiguation 
accuracy can be improved by combining a number 
of simple classifiers into an ensemble. A methodol- 
ogy for formulating an ensemble of Naive Bayesian 
classifiers is presented, where each member classifier 
is based on co-occurrence features extracted from 
a different sized window of context. This approach 
was evaluated using the widely studied nouns line 
and interest, which are disambiguated with accuracy 
of 88% and 89%, which rivals the best previously 
published results. 
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